Goto

Collaborating Authors

 control signal







Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation

Neural Information Processing Systems

Text-conditional diffusion models are able to generate high-fidelity images with diverse contents.However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model.


Aligning Large Language Models with Representation Editing: A Control Perspective

Neural Information Processing Systems

Aligning large language models (LLMs) with human objectives is crucial for real-world applications. However, fine-tuning LLMs for alignment often suffers from unstable training and requires substantial computing resources. Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model, and their performance remains dependent on the original model's capabilities. To address these challenges, we propose aligning LLMs through representation editing. The core of our method is to view a pre-trained autoregressive LLM as a discrete-time stochastic dynamical system. To achieve alignment for specific objectives, we introduce external control signals into the state space of this language dynamical system. We train a value function directly on the hidden states according to the Bellman equation, enabling gradient-based optimization to obtain the optimal control signals at test time. Our experiments demonstrate that our method outperforms existing test-time alignment techniques while requiring significantly fewer resources compared to fine-tuning methods.


Error-Centric PID Untrained Neural-Net (EC-PIDUNN) For Nonlinear Robotics Control

Razzaq, Waleed

arXiv.org Artificial Intelligence

Classical Proportional-Integral-Derivative (PID) control has been widely successful across various industrial systems such as chemical processes, robotics, and power systems. However, as these systems evolved, the increase in the nonlinear dynamics and the complexity of interconnected variables have posed challenges that classical PID cannot effectively handle, often leading to instability, overshooting, or prolonged settling times. Researchers have proposed PIDNN models that combine the function approximation capabilities of neural networks with PID control to tackle these nonlinear challenges. However, these models require extensive, highly refined training data and have significant computational costs, making them less favorable for real-world applications. In this paper, We propose a novel EC-PIDUNN architecture, which integrates an untrained neural network with an improved PID controller, incorporating a stabilizing factor (\(τ\)) to generate the control signal. Like classical PID, our architecture uses the steady-state error \(e_t\) as input bypassing the need for explicit knowledge of the systems dynamics. By forming an input vector from \(e_t\) within the neural network, we increase the dimensionality of input allowing for richer data representation. Additionally, we introduce a vector of parameters \( ρ_t \) to shape the output trajectory and a \textit{dynamic compute} function to adjust the PID coefficients from predefined values. We validate the effectiveness of EC-PIDUNN on multiple nonlinear robotics applications: (1) nonlinear unmanned ground vehicle systems that represent the Ackermann steering mechanism and kinematics control, (2) Pan-Tilt movement system. In both tests, it outperforms classical PID in convergence and stability achieving a nearly critically damped response.


2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency

Yin, Xingxi, Li, Yicheng, Yan, Gong, Li, Chenglin, Zhao, Jian, Huang, Cong, Deng, Yue, Zhang, Yin

arXiv.org Artificial Intelligence

Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf{2K-Characters-10K-Stories}, a multi-modal stylized narrative dataset of \textbf{2{,}000} uniquely stylized characters appearing across \textbf{10{,}000} illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbf{Human-in-the-Loop pipeline (HiL)} that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbf{decoupled control} scheme separates persistent identity from transient attributes -- pose and expression -- while a \textbf{Quality-Gated loop} integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.


Monitor-Generate-Verify (MGV): Formalising Metacognitive Theory for Language Model Reasoning

Oh, Nick, Gobet, Fernand

arXiv.org Artificial Intelligence

Test-time reasoning architectures such as those following the Generate-Verify paradigm, where a model iteratively refines or verifies its own generated outputs, prioritise generation and verification but exclude the monitoring processes that determine when and how reasoning should begin. This omission may contribute to the prefix dominance trap, in which models commit early to suboptimal reasoning paths and seldom recover, yielding roughly 20% accuracy loss. We address this architectural gap by proposing the Monitor-Generate-Verify (MGV) framework, a computational translation of Flavell's and Nelson and Narens' metacognitive theories that preserves their psychological detail. MGV extends the Generate-Verify paradigm by adding explicit monitoring that captures metacognitive experiences (from difficulty assessments to confidence judgements) before generation begins and refines future monitoring through verification feedback. Though we present no empirical validation, MGV provides a vocabulary for diagnosing component-level failures in reasoning systems, suggests specific architectural interventions for future designs, and identifies connections to resource-rational analysis that may ground its mechanisms in normative principles.